The purpose of this research is to explore the factors that define a game’s success, exploring genre, price of the game, in app purchases, description of the game, what languages the game is offered in, the developers of the game, and the age range of the audience to whom is permitted to play the game. By analyzing these attributes, we are capable of better understanding the reasons behind the success rate of some app games over others. The data entitled “17K Mobile Strategy Games” in which we are analyzing is made up of all strategy games from the Apple App store. We hypothesize games that are free with a wide audience and eye-catching descriptions will draw more users and lead to a game’s success/popularity, where success is being defined as a game which has both a high user count as well as high user ratings. We are making the assumption that games that are free would attract a higher user count and cause those games to have a better chance of succeeding. The results of this research may help gaming developers prioritize some of the important attributes discovered and aid in their games’ success and popularity worldwide.
Data exploration and visualization 1. Which genre is the most popular? 2. Which words are most commonly used in Description of games? 3. Does genre of games cause people to spend more money? 4. What are the top languages in which games are offered? 5. What is the distribution of user rating across genre? 6. Which genre of game does better internationally? 7. What is the relationship between initial price of apps and average user rating? 8. What is the average price of in-app purchases? 9. Is there a relationship between user rating and in-app purchases? And does the amount of available in-app purchases decrease rating? 10. What information can we find about game developers and their strategy games? 11. What is the frequency of the age groups? 12. How has the size of the applications of the top 3 primary genres changed over a span of about 11 years?
Data analysis, modeling and/or predictions 13. What contributes to a game’s success? 14. Can we predict if an app is free or not? 15. What primary genre is similar to the “Games” genre?
To start this analysis we first want to clean the original dataset:
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
##
## -- Column specification --------------------------------------------------------
## cols(
## URL = col_character(),
## ID = col_double(),
## Name = col_character(),
## Subtitle = col_character(),
## `Icon URL` = col_character(),
## `Average User Rating` = col_double(),
## `User Rating Count` = col_double(),
## Price = col_double(),
## `In-app Purchases` = col_character(),
## Description = col_character(),
## Developer = col_character(),
## `Age Rating` = col_character(),
## Languages = col_character(),
## Size = col_double(),
## `Primary Genre` = col_character(),
## Genres = col_character(),
## `Original Release Date` = col_character(),
## `Current Version Release Date` = col_character()
## )
## # A tibble: 6 x 15
## Name `Icon URL` `Average User R~ `User Rating Co~ Price `In-app Purchas~
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Sudoku https://is2-~ 4 3553 2.99 <NA>
## 2 Reversi https://is4-~ 3.5 284 1.99 <NA>
## 3 Morocco https://is5-~ 3 8376 0 <NA>
## 4 Sudoku~ https://is3-~ 3.5 190394 0 <NA>
## 5 Senet ~ https://is1-~ 3.5 28 2.99 <NA>
## 6 Sudoku~ https://is1-~ 3 47 0 1.99
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## # Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## # Genres <chr>, Original Release Date <date>,
## # Current Version Release Date <date>
Separating Data and Renaming Variables:
## # A tibble: 7,488 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 2.99 4 Price Average_User_Rating
## 2 1.99 3.5 Price Average_User_Rating
## 3 0 3 Price Average_User_Rating
## 4 0 3.5 Price Average_User_Rating
## 5 2.99 3.5 Price Average_User_Rating
## 6 0 3 Price Average_User_Rating
## 7 0 2.5 Price Average_User_Rating
## 8 0.99 2.5 Price Average_User_Rating
## 9 0 2.5 Price Average_User_Rating
## 10 0 2.5 Price Average_User_Rating
## # ... with 7,478 more rows
Cleaning Original Data
## # A tibble: 6 x 15
## Name `Icon URL` `Average User R~ `User Rating Co~ Price `In-app Purchas~
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Sudoku https://is2-~ 4 3553 2.99 <NA>
## 2 Reversi https://is4-~ 3.5 284 1.99 <NA>
## 3 Morocco https://is5-~ 3 8376 0 <NA>
## 4 Sudoku~ https://is3-~ 3.5 190394 0 <NA>
## 5 Senet ~ https://is1-~ 3.5 28 2.99 <NA>
## 6 Sudoku~ https://is1-~ 3 47 0 1.99
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## # Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## # Genres <chr>, Original Release Date <date>,
## # Current Version Release Date <date>
Most Popular Genres
## # A tibble: 6 x 2
## `Primary Genre` n
## <chr> <int>
## 1 Games 7220
## 2 Entertainment 92
## 3 Education 46
## 4 Utilities 44
## 5 Sports 23
## 6 Reference 18
What application genre is the most popular as in which type of genre do developers make the most of. Companies could use this information to maybe find out what genre is over-saturated and move into a lesser known genre or they could just follow what’s popular.
From the graph, the top 2 primary genres are games and entertainment and the least common are music applications. Game applications are probably the most common due to the range of creativity, its popularity and they’re very profitable when it comes to ads. Music might be less popular because there’s a difficult barrier of entry, you’ll need a lot of storage and getting licences can be expensive and difficult.
After exploring which genre is most popular among users, we examined if genre of games had any influence on the amount of money spent for the app (purchase price) or in the app (in-app purchases).
Cleaning and separating data
## # A tibble: 7,488 x 5
## # Groups: "Genres" [1]
## Name Price InApp Genres `"Genres"`
## <chr> <dbl> <chr> <chr> <chr>
## 1 "Sudoku" 2.99 <NA> Games, Strategy, Puzzle Genres
## 2 "Reversi" 1.99 <NA> Games, Strategy, Board Genres
## 3 "Morocco" 0 <NA> Games, Board, Strategy Genres
## 4 "Sudoku (Free)" 0 <NA> Games, Strategy, Puzzle Genres
## 5 "Senet Deluxe" 2.99 <NA> Games, Strategy, Board, Edu~ Genres
## 6 "Sudoku - Classic number~ 0 1.99 Games, Entertainment, Strat~ Genres
## 7 "Gravitation" 0 <NA> Games, Entertainment, Puzzl~ Genres
## 8 "Colony" 0.99 <NA> Games, Strategy, Board Genres
## 9 "Carte" 0 <NA> Games, Strategy, Board, Ent~ Genres
## 10 "\"Barrels O' Fun\"" 0 <NA> Games, Casual, Strategy Genres
## # ... with 7,478 more rows
Due to the genre and in-app purchases columns having values separated by columns, the separate_rows() function is utilized in order to split each individual element onto a new line.
## # A tibble: 48 x 3
## Genres avgPrice avgInApp
## <chr> <dbl> <dbl>
## 1 News 0.0762 23.9
## 2 Networking 0.0490 21.0
## 3 Social 0.0490 21.0
## 4 Medical 0.707 20.0
## 5 Business 0.318 17.6
## 6 Playing 0.250 17.6
## 7 Role 0.250 17.6
## 8 Card 0.339 12.8
## 9 Action 0.285 12.0
## 10 Simulation 0.382 11.9
## # ... with 38 more rows
## # A tibble: 48 x 4
## Genres avgPrice avgInApp totalavg
## <chr> <dbl> <dbl> <dbl>
## 1 Weather 9.99 NaN 9.99
## 2 Finance 3.97 11.3 15.3
## 3 Reference 3.65 5.74 9.38
## 4 Board 1.04 6.28 7.32
## 5 Education 0.714 5.31 6.02
## 6 Medical 0.707 20.0 20.7
## 7 Productivity 0.638 9.88 10.5
## 8 Emoji 0.495 NaN 0.495
## 9 Expressions 0.495 NaN 0.495
## 10 Utilities 0.389 3.08 3.47
## # ... with 38 more rows
## # A tibble: 48 x 4
## Genres avgPrice avgInApp totalavg
## <chr> <dbl> <dbl> <dbl>
## 1 News 0.0762 23.9 23.9
## 2 Networking 0.0490 21.0 21.1
## 3 Social 0.0490 21.0 21.1
## 4 Medical 0.707 20.0 20.7
## 5 Business 0.318 17.6 18.0
## 6 Playing 0.250 17.6 17.9
## 7 Role 0.250 17.6 17.9
## 8 Finance 3.97 11.3 15.3
## 9 Card 0.339 12.8 13.1
## 10 Simulation 0.382 11.9 12.3
## # ... with 38 more rows
Upon being separated into new rows, the summarize() function is employed to calculate the average purchase price for the app itself, average of money spent on in-app purchases, and the total average amount spent on both the app itself and in-app purchases.
Upon visual inspection and viewing previous tables, based on average upfront cost of the app, the weather genre is the leading most popular genre where people are willing to spend an average of $9.99. Whereas, for the News genre, little upfront cost is paid, however, the average in-app purchases are at $23.87. The top 3 genres that caused people to spend the most money are: News, Networking, and Social.
A game’s description is just as important as the hook is in an essay. Just as the hook draws in your audience, the description is used to attract more users to your game which is key to a game’s success. If no one is finding your game, then your description has not adequately captivated your audience. So, in order to determine which words were most often used to describe games, we split the description of the game into multiple strings and found the frequency of each word used, removing any sort of article or non-descriptive words such as “a, the, an, it, this, be, etc.” Looking primarily for adjectives, words that could describe what made their game different or special compared to others, the following word cloud to the left depicts some of the top words used in game descriptions.
Cleaning Original Data
## # A tibble: 6 x 15
## Name `Icon URL` `Average User R~ `User Rating Co~ Price `In-app Purchas~
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Sudoku https://is2-~ 4 3553 2.99 <NA>
## 2 Reversi https://is4-~ 3.5 284 1.99 <NA>
## 3 Morocco https://is5-~ 3 8376 0 <NA>
## 4 Sudoku~ https://is3-~ 3.5 190394 0 <NA>
## 5 Senet ~ https://is1-~ 3.5 28 2.99 <NA>
## 6 Sudoku~ https://is1-~ 3 47 0 1.99
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## # Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## # Genres <chr>, Original Release Date <date>,
## # Current Version Release Date <date>
Cleaning Description and Creating Wordcloud
## # A tibble: 6 x 2
## Name word
## <chr> <chr>
## 1 Sudoku join
## 2 Sudoku over
## 3 Sudoku of
## 4 Sudoku our
## 5 Sudoku fans
## 6 Sudoku and
## Joining, by = "word"
## # A tibble: 6 x 2
## Name word
## <chr> <chr>
## 1 Sudoku join
## 2 Sudoku fans
## 3 Sudoku download
## 4 Sudoku one
## 5 Sudoku sudoku
## 6 Sudoku game
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 game 21505
## 2 play 6732
## 3 new 5678
## 4 world 4064
## 5 players 3616
## 6 strategy 3457
## 7 time 3434
## 8 free 3410
## 9 battle 3356
## 10 levels 3099
Looking at the wordcloud we can see that game is use the most often. Afterward is play, new, world, players, levels.
Top 10 Game Descriptors, excluding ambiguous phrases/words and repeated plural versions of the same word:
Cleaning Original Data
## # A tibble: 6 x 15
## Name `Icon URL` `Average User R~ `User Rating Co~ Price `In-app Purchas~
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Sudoku https://is2-~ 4 3553 2.99 <NA>
## 2 Reversi https://is4-~ 3.5 284 1.99 <NA>
## 3 Morocco https://is5-~ 3 8376 0 <NA>
## 4 Sudoku~ https://is3-~ 3.5 190394 0 <NA>
## 5 Senet ~ https://is1-~ 3.5 28 2.99 <NA>
## 6 Sudoku~ https://is1-~ 3 47 0 1.99
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## # Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## # Genres <chr>, Original Release Date <date>,
## # Current Version Release Date <date>
Frequency Bar Chart of Languages
Cleaning the Language Column
## # A tibble: 6 x 2
## Name Languages
## <chr> <chr>
## 1 Sudoku DA
## 2 Sudoku NL
## 3 Sudoku EN
## 4 Sudoku FI
## 5 Sudoku FR
## 6 Sudoku DE
Finding Frequency of Top 15 Languages
## # A tibble: 6 x 3
## Languages total full_lang
## <chr> <int> <lgl>
## 1 EN 7429 NA
## 2 DE 1573 NA
## 3 ZH 1548 NA
## 4 FR 1519 NA
## 5 ES 1473 NA
## 6 JA 1354 NA
Import Full Lang, Clean Dataset, Create Loop to Put Full Name Instead of Abbrev
##
## -- Column specification --------------------------------------------------------
## cols(
## alpha2 = col_character(),
## English = col_character()
## )
## # A tibble: 10 x 2
## alpha2 English
## <chr> <chr>
## 1 br Breton
## 2 bs Bosnian
## 3 ca Catalan; Valencian
## 4 ce Chechen
## 5 ch Chamorro
## 6 co Corsican
## 7 cr Cree
## 8 cs Czech
## 9 cu Church Slavic; Old Slavonic; Church Slavonic; Old Bulgarian; Old Chur~
## 10 cv Chuvash
## # A tibble: 10 x 2
## alpha2 English
## <chr> <chr>
## 1 BR Breton
## 2 BS Bosnian
## 3 CA Catalan
## 4 CE Chechen
## 5 CH Chamorro
## 6 CO Corsican
## 7 CR Cree
## 8 CS Czech
## 9 CU Church Slavic
## 10 CV Chuvash
Bar Plot of Top 15 Languages
We wanted to explore what languages occurred most often in applications. As was expected, the most popular application language is English, followed by German, Chinese, and French. Most likely that is because most of the audience on the apple store speaks English, so most apps include the language.
Cleaning Original Data
## # A tibble: 6 x 15
## Name `Icon URL` `Average User R~ `User Rating Co~ Price `In-app Purchas~
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Sudoku https://is2-~ 4 3553 2.99 <NA>
## 2 Reversi https://is4-~ 3.5 284 1.99 <NA>
## 3 Morocco https://is5-~ 3 8376 0 <NA>
## 4 Sudoku~ https://is3-~ 3.5 190394 0 <NA>
## 5 Senet ~ https://is1-~ 3.5 28 2.99 <NA>
## 6 Sudoku~ https://is1-~ 3 47 0 1.99
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## # Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## # Genres <chr>, Original Release Date <date>,
## # Current Version Release Date <date>
Violin of Average User Rating Across Genre
## [1] "Games" "Entertainment" "Education" "Utilities"
## [5] "Sports" "Reference"
## # A tibble: 6 x 2
## `Average User Rating` `Primary Genre`
## <dbl> <chr>
## 1 4 Games
## 2 3.5 Games
## 3 3 Games
## 4 3.5 Games
## 5 3.5 Games
## 6 3 Games
Next we wanted to look at the average user rating across different primary genres. From the violin plot, you can see that there’s a lot of variability in each primary genre with the exception of the book genre. A possible reason the book genre has little outliers is because there isn’t as much data as say the games genre.
The graph also shows that on the left half there isn’t really a high concentration of user ratings in one area, it’s kind of spread around in comparison to something like the games genre where you can specifically see that there’s a higher concentration of ratings around 4.5.
After identifying the top languages in which games are offered, we then decided to delve into which genre of games did better internationally.
Cleaning and separating data
## # A tibble: 7,488 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 2.99 4 Price Average_User_Rating
## 2 1.99 3.5 Price Average_User_Rating
## 3 0 3 Price Average_User_Rating
## 4 0 3.5 Price Average_User_Rating
## 5 2.99 3.5 Price Average_User_Rating
## 6 0 3 Price Average_User_Rating
## 7 0 2.5 Price Average_User_Rating
## 8 0.99 2.5 Price Average_User_Rating
## 9 0 2.5 Price Average_User_Rating
## 10 0 2.5 Price Average_User_Rating
## # ... with 7,478 more rows
The separate_rows() function is utilized on the languages and genres columns in order to place each individual language and genre separated by column into a new row.
## # A tibble: 6 x 3
## Name Languages Genres
## <chr> <chr> <chr>
## 1 Sudoku DA Games
## 2 Sudoku DA Strategy
## 3 Sudoku DA Puzzle
## 4 Sudoku NL Games
## 5 Sudoku NL Strategy
## 6 Sudoku NL Puzzle
Since we are only wanting to look at international languages, English is excluded from this dataset, and the data frame is grouped by language and genre, summarizing the total count of each language/genre pair.
## `summarise()` has grouped output by 'Languages'. You can override using the `.groups` argument.
## # A tibble: 1,389 x 3
## # Groups: Languages [112]
## Languages Genres total
## <chr> <chr> <int>
## 1 ZH Games 2712
## 2 ZH Strategy 2712
## 3 DE Games 1573
## 4 DE Strategy 1573
## 5 FR Games 1519
## 6 FR Strategy 1519
## 7 ES Games 1473
## 8 ES Strategy 1473
## 9 ZH Entertainment 1408
## 10 JA Games 1354
## # ... with 1,379 more rows
Based on this table, the top two genres that are the most popular are in ZH (Chinese) and have a games and strategy genre , with DE (German) and FR (French) coming in 2nd and 3rd place, also favoring games and strategy genres.
Next, we wanted to look at the relationship between different age ratings and their user rating across primary genres. If there are missing columns like in finance, it just means that the finance apps generally have their apps available for all ages.
Looking at the books graph, you can see that the book applications that are rated for teens and up have a higher rating ran for children. The games genre is relatively similar throughout all age ratings, slightly dropping off at the 17+ games.
What was most interesting was that the social networking apps that allowed children 4+ to use the application were rated really low. The ratings could be from upset parents frustrated that their child is messaging someone online. Companies could possibly look at this and set age restrictions to prevent younger children going onto these social networking apps and maybe their ratings will increase.
Cleaning and separating data
## # A tibble: 7,488 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 2.99 4 Price Average_User_Rating
## 2 1.99 3.5 Price Average_User_Rating
## 3 0 3 Price Average_User_Rating
## 4 0 3.5 Price Average_User_Rating
## 5 2.99 3.5 Price Average_User_Rating
## 6 0 3 Price Average_User_Rating
## 7 0 2.5 Price Average_User_Rating
## 8 0.99 2.5 Price Average_User_Rating
## 9 0 2.5 Price Average_User_Rating
## 10 0 2.5 Price Average_User_Rating
## # ... with 7,478 more rows
library(gridExtra) # for the grid.arrange() function
G1 <-ggplot(data = clean_data) +
geom_bar(mapping = aes(x = Price))+
coord_cartesian(xlim = c(0, 20)) +
labs(title = "Overall Price", # change title
x = "Prices (excluding prices over $20)") # change x lab
G2 <-ggplot(data = clean_data) +
geom_bar(mapping = aes(x = AUR))+
coord_cartesian(xlim = c(0, 5)) +
labs(title = "Overall Average User Rating", # change title
x = "Average USer Rating") # change x lab
grid.arrange(G1, G2,ncol=2)
The distribution for prices and ratings. One of the most important factors people would look at is money. It’s more likely that a game that is free would have more downloads and users than a game with an initial monetary entry. Unsurprisingly, when a game or app is free, the user count is massively higher than games that require an upfront cost. Many questions would also come from this such as the quality of product from a free game vs one that is paid. Some might think a paid game would naturally be “better in quality” than one that is free since the cost of entry is higher. The overall average user rating showed that 4.5 is the most common rating between all price points combined.
## # A tibble: 123 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 5.99 4 Price Average_User_Rating
## 2 7.99 4 Price Average_User_Rating
## 3 7.99 4 Price Average_User_Rating
## 4 5.99 2.5 Price Average_User_Rating
## 5 9.99 4 Price Average_User_Rating
## 6 9.99 5 Price Average_User_Rating
## 7 7.99 4 Price Average_User_Rating
## 8 5.99 3 Price Average_User_Rating
## 9 5.99 4.5 Price Average_User_Rating
## 10 9.99 3.5 Price Average_User_Rating
## # ... with 113 more rows
## [[1]]
## # A tibble: 6,269 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 0 3 Price Average_User_Rating
## 2 0 3.5 Price Average_User_Rating
## 3 0 3 Price Average_User_Rating
## 4 0 2.5 Price Average_User_Rating
## 5 0 2.5 Price Average_User_Rating
## 6 0 2.5 Price Average_User_Rating
## 7 0 3.5 Price Average_User_Rating
## 8 0 3 Price Average_User_Rating
## 9 0 2.5 Price Average_User_Rating
## 10 0 3 Price Average_User_Rating
## # ... with 6,259 more rows
##
## [[2]]
## # A tibble: 348 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 0.99 2.5 Price Average_User_Rating
## 2 0.99 3.5 Price Average_User_Rating
## 3 0.99 3 Price Average_User_Rating
## 4 0.99 2 Price Average_User_Rating
## 5 0.99 4 Price Average_User_Rating
## 6 0.99 2.5 Price Average_User_Rating
## 7 0.99 3.5 Price Average_User_Rating
## 8 0.99 3.5 Price Average_User_Rating
## 9 0.99 3 Price Average_User_Rating
## 10 0.99 3 Price Average_User_Rating
## # ... with 338 more rows
##
## [[3]]
## # A tibble: 446 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 2.99 4 Price Average_User_Rating
## 2 1.99 3.5 Price Average_User_Rating
## 3 2.99 3.5 Price Average_User_Rating
## 4 2.99 4 Price Average_User_Rating
## 5 2.99 2.5 Price Average_User_Rating
## 6 2.99 4 Price Average_User_Rating
## 7 2.99 3.5 Price Average_User_Rating
## 8 1.99 4 Price Average_User_Rating
## 9 2.99 4 Price Average_User_Rating
## 10 2.99 3 Price Average_User_Rating
## # ... with 436 more rows
##
## [[4]]
## # A tibble: 285 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 4.99 4 Price Average_User_Rating
## 2 4.99 4 Price Average_User_Rating
## 3 3.99 3 Price Average_User_Rating
## 4 4.99 3.5 Price Average_User_Rating
## 5 3.99 4.5 Price Average_User_Rating
## 6 4.99 4.5 Price Average_User_Rating
## 7 3.99 3.5 Price Average_User_Rating
## 8 3.99 3.5 Price Average_User_Rating
## 9 4.99 4 Price Average_User_Rating
## 10 4.99 4 Price Average_User_Rating
## # ... with 275 more rows
##
## [[5]]
## # A tibble: 123 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 5.99 4 Price Average_User_Rating
## 2 7.99 4 Price Average_User_Rating
## 3 7.99 4 Price Average_User_Rating
## 4 5.99 2.5 Price Average_User_Rating
## 5 9.99 4 Price Average_User_Rating
## 6 9.99 5 Price Average_User_Rating
## 7 7.99 4 Price Average_User_Rating
## 8 5.99 3 Price Average_User_Rating
## 9 5.99 4.5 Price Average_User_Rating
## 10 9.99 3.5 Price Average_User_Rating
## # ... with 113 more rows
##
## [[6]]
## # A tibble: 17 x 4
## # Groups: "Price", "Average_User_Rating" [1]
## Price AUR `"Price"` `"Average_User_Rating"`
## <dbl> <dbl> <chr> <chr>
## 1 20.0 4.5 Price Average_User_Rating
## 2 12.0 3.5 Price Average_User_Rating
## 3 12.0 4.5 Price Average_User_Rating
## 4 140. 4.5 Price Average_User_Rating
## 5 20.0 4.5 Price Average_User_Rating
## 6 13.0 4 Price Average_User_Rating
## 7 20.0 3.5 Price Average_User_Rating
## 8 20.0 4 Price Average_User_Rating
## 9 20.0 4 Price Average_User_Rating
## 10 15.0 3.5 Price Average_User_Rating
## 11 13.0 4 Price Average_User_Rating
## 12 15.0 4 Price Average_User_Rating
## 13 17.0 4 Price Average_User_Rating
## 14 13.0 3 Price Average_User_Rating
## 15 12.0 4.5 Price Average_User_Rating
## 16 37.0 4 Price Average_User_Rating
## 17 60.0 4 Price Average_User_Rating
Filtering Games by prices
Plotting to see what the average ratings for apps in each price point are. We decided to separate the charts to see if the ratings would be different or not. As you can see, the free apps have a sample size that is way higher than all the other price points combined. This isn’t much of a surprise since we expect the free games to have a much lower point of entry than the others. This leads to more players trying out the game. The more shocking information is how the ratings between all price points were relatively constant throughout. The price point that had the lowest ratings overall seem to be the games that were the most expensive too. This raises a question on whether the quality of product expected only starts to come in when an app or game goes up to a certain price point and beyond.
After taking a glance at the initial purchase price of apps, we then explored in-app purchase prices.
In order to find the average price of in-app purchases, the dataset was filtered to include only the name, price, average user rating, and in-app purchase prices. Since the in app purchase prices column had multiple price offerings per game separated by column, we used the separate rows() function to split each individual in app purchase price onto a new row, and convert all values to numerics. From there, the summarise function was implemented to find the average in-app purchase price.
Average User Rating = AUR InApp = In app purchases
## # A tibble: 1 x 3
## avgPrice avgRating avgInApp
## <dbl> <dbl> <dbl>
## 1 0.321 4.19 11.4
The average price of in-app purchases is approximately $11.40.
Then, after calculating the average in-app purchase price, we examined if a relationship existed between user ratings and in-app purchases.
So in order to visually see this, we plotted the average user rating against the in-app purchase price to determine if there was any sort of trend, along with checking the correlation between the two variables.
## [1] -0.01262201
The negative correlation returned from Average User Rating and In App Purchases can lead us to believe that as the cost of In App Purchases increase, therefore, the Average User Rating will decrease.
However, with a correlation coefficient of -0.01, being so close to 0, shows that while a negative relationship does exist, the existence of a relationship between In App Purchases and Average User Rating is extremely minimal.
## Selecting by User Rating Count
## # A tibble: 5 x 4
## Name Developer `Average User Rat~ `User Rating Cou~
## <chr> <chr> <dbl> <dbl>
## 1 "Clash of Clans" Supercell 4.5 3032734
## 2 "Clash Royale" Supercell 4.5 1277095
## 3 "PUBG MOBILE" Tencent Mobile Interna~ 4.5 711409
## 4 "Plants vs. Zomb~ PopCap 4.5 469562
## 5 "Pok\\xe9mon GO" Niantic, Inc. 3.5 439776
For this graph, the popularity of a game is measured with a high user rating count, instead of Average User Rating. Average User Count is not a good measurement for popularity because a lot of games can have a very high rating, but very low count of ratings. When looking at the graph, a surprising thing we found was that the two most popular games were both created by the same game developer, Supercell. 4 out of 5 of the games shown on the graph also have a really high average user rating of 4.5.
When looking at the graph, a large majority of the games have the age ratings as 4 and above. Games have lower age ratings so they can attract more users.
Cleaning Original Data
## # A tibble: 6 x 15
## Name `Icon URL` `Average User R~ `User Rating Co~ Price `In-app Purchas~
## <chr> <chr> <dbl> <dbl> <dbl> <chr>
## 1 Sudoku https://is2-~ 4 3553 2.99 <NA>
## 2 Reversi https://is4-~ 3.5 284 1.99 <NA>
## 3 Morocco https://is5-~ 3 8376 0 <NA>
## 4 Sudoku~ https://is3-~ 3.5 190394 0 <NA>
## 5 Senet ~ https://is1-~ 3.5 28 2.99 <NA>
## 6 Sudoku~ https://is1-~ 3 47 0 1.99
## # ... with 9 more variables: Description <chr>, Developer <chr>,
## # Age Rating <chr>, Languages <chr>, Size <dbl>, Primary Genre <chr>,
## # Genres <chr>, Original Release Date <date>,
## # Current Version Release Date <date>
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
## `summarise()` has grouped output by 'date'. You can override using the `.groups` argument.
Next we asked how the size of the applications of the top 3 primary genres changed over a span of about 11 years. As we can see, the bytes of the applications increased quite a lot, averaging now at about 3 time 10^8 bytes.
This makes sense as applications become more complex including more lines of code, more features, higher resolution images and 3D models with more polygons. This will all increase the size of the application.
There are a lot of ways to measure the success of a game. With the dataset we have , we decided that Average User rating would be a good way to measure that success.
We decided to go with these 4 variables as our predictors since they seem to be important factors that would play a part in a game’s rating. Age rating helps focus a game to a specific age group which might give it a better chance of a good rating. Expectations of a game based on age rating might differ and some of those expectations might be easier to satisfy compared to others. Price gives an expectation on how good the game should be as users would’ve given an initial “investment” before actually playing the game. Size would make a game more appealing as it would mean the game has more features and might be a more refined game compared to those who are much smaller in size. User Rating Count can show how active the game is and lets us know the sample size behind each rating. A bigger sample size would be better since it would reinforce if the game would be entertaining for a large group of people.
Null hypothesis: H0: β1 = β2 = · · · = βp = 0 There is no relationship between X1, X2, · · · , Xp and Y at all
Alternative hypothesis: Hα: at least one βj =/= 0 There is some relationship between Xj and Y .
This will be our hypothesis testing to see if the predictors have a relationship with our Y, which in this case would be Average User Rating. To test this we would have to calculate the p-value for the predictors with relationship to our Y.
##
## Call:
## lm(formula = `Average User Rating` ~ Price + `User Rating Count` +
## `Age Rating` + Size, data = clean_games)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1101 -0.5272 0.2995 0.4624 1.0804
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.073e+00 2.295e-02 177.502 < 2e-16 ***
## Price -3.729e-03 3.623e-03 -1.029 0.30336
## `User Rating Count` 5.267e-07 2.038e-07 2.584 0.00978 **
## `Age Rating`17+ -1.518e-01 4.884e-02 -3.108 0.00189 **
## `Age Rating`4+ -4.668e-02 2.439e-02 -1.914 0.05568 .
## `Age Rating`9+ -1.075e-02 2.863e-02 -0.375 0.70735
## Size 1.650e-10 3.572e-11 4.619 3.92e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7484 on 7481 degrees of freedom
## Multiple R-squared: 0.006409, Adjusted R-squared: 0.005612
## F-statistic: 8.042 on 6 and 7481 DF, p-value: 1.123e-08
## value numdf dendf
## 8.042076 6.000000 7481.000000
We ran a multiple linear regression test and we were able to get some important information. Firstly, our p-value given is really low and because of that, we can safely reject our null hypothesis and accept our alternative hypothesis while saying that there is some relationship between our Y and predictors. Aslo, our F-statistics is greater than 1 which also tells us that there are some relationships between predictors and Y. Then, we got a very low adjusted R2. A low adjusted R2 indicates that the independent variable is not explaining much in the variation of the dependent variable. RSE tells us the lack of fit and a small RSE tells us how good the fit of the model would be. The RSE we got was really low so it tells us that that model fits really well in our data. Lastly, let’s pick our best predictor out of the bunch to see which one would define our Y the best. If we look at the p-value and abs(t-value) we can also conclude that size is the best predictor for Y or Average User Rating since it has the lowest p-value by far and the highest t-value.
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
We plotted the predictors and see that Age rating doesn’t show a linear regression line while Price has a constant neutral regression line. However, both Size and User Rating count have a positive linear trend.
For the next model, we wanted to predict if the application will be free or not using multiple logistic regression.
To start off, we had to do some initial cleaning. The in-app purchases column contained a string of all the purchases the app had. From there we separated the rows and turned them into doubles. Now that they were doubles we could summarise to find the total amount of in app purchases, total count and avg iap. We also created classes to group the app avg iaps because that would prove to be useful in later models.
## # A tibble: 6 x 14
## Name `Average User Ra~ `User Rating Co~ Price `In-app Purchas~ Description
## <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 Sudoku 4 3553 2.99 <NA> "Join over 2~
## 2 Rever~ 3.5 284 1.99 <NA> "The classic~
## 3 Moroc~ 3 8376 0 <NA> "Play the cl~
## 4 Sudok~ 3.5 190394 0 <NA> "Top 100 fre~
## 5 Senet~ 3.5 28 2.99 <NA> "\"Senet Del~
## 6 Sudok~ 3 47 0 1.99 "Sudoku will~
## # ... with 8 more variables: Developer <chr>, Age Rating <chr>,
## # Languages <chr>, Size <dbl>, Primary Genre <chr>, Genres <chr>,
## # Original Release Date <date>, Current Version Release Date <date>
## # A tibble: 7,488 x 4
## Name sum.iap count.iap avg.iap
## <chr> <dbl> <int> <dbl>
## 1 "Bungee Stickmen - Australian Landmarks {LITE +}" 239. 3 79.7
## 2 "Arcane Pets: Plushie Empire" 300. 4 75.0
## 3 "War of Nations\\u2122 - PVP Strategy" 675. 10 67.5
## 4 "War Planet Online" 665. 10 66.5
## 5 "Final Fantasy XV: A New Empire" 655. 10 65.5
## 6 "My Math Elementary Kids Games" 174. 3 58.0
## 7 "Imperial Ambition" 551. 10 55.1
## 8 "Idle Crypto Tycoon" 103. 2 51.5
## 9 "World War Rising" 515. 10 51.5
## 10 "Clash of Queens: Light or Dark" 411. 8 51.4
## # ... with 7,478 more rows
## Warning: Unknown or uninitialised column: `iap.class`.
## # A tibble: 6 x 5
## Name sum.iap count.iap avg.iap iap.class
## <chr> <dbl> <int> <dbl> <chr>
## 1 "- Turning -" 3.98 2 1.99 $0.01-$10.00
## 2 "! Chess !" 0 1 0 $0
## 3 "\"100 Years' War\"" 3.98 2 1.99 $0.01-$10.00
## 4 "\"3D Rubik's Cube : Rubik Solver\"" 0 1 0 $0
## 5 "\"3x3 Rubik's Cube Solver\"" 0 1 0 $0
## 6 "\"9 Men's Morris\"" 0.99 1 0.99 $0.01-$10.00
Next, we were given date columns like the day the app was released and the day they were last updated, but we can’t really use the dates in a model. So we found the total number of days since the app was released and the days since its last update by subtracting the date of release and the date of last update by the date the data was scraped (08-03-2019).
And because we wanted to predict if the applications are free or not, we use an if-else statement to assign a 1 if the app was free and a 0 if it wasn’t.
## # A tibble: 6 x 6
## `Original Release ~ `Current Version Rel~ sum.iap count.iap avg.iap iap.class
## <date> <date> <dbl> <int> <dbl> <chr>
## 1 2008-07-11 2017-05-30 0 1 0 $0
## 2 2008-07-11 2018-05-17 0 1 0 $0
## 3 2008-07-11 2017-09-05 0 1 0 $0
## 4 2008-07-23 2017-05-30 0 1 0 $0
## 5 2008-07-18 2018-07-22 0 1 0 $0
## 6 2008-07-30 2019-04-29 1.99 1 1.99 $0.01-$10~
## # A tibble: 7,464 x 6
## Name Price Today days.since.relea~ days.since.last.u~ free
## <chr> <dbl> <date> <dbl> <dbl> <dbl>
## 1 "Sudoku" 2.99 2019-08-03 4040 795 0
## 2 "Reversi" 1.99 2019-08-03 4040 443 0
## 3 "Morocco" 0 2019-08-03 4040 697 1
## 4 "Sudoku (Free)" 0 2019-08-03 4028 795 1
## 5 "Senet Deluxe" 2.99 2019-08-03 4033 377 0
## 6 "Sudoku - Classi~ 0 2019-08-03 4021 96 1
## 7 "Colony" 0.99 2019-08-03 4017 304 0
## 8 "Carte" 0 2019-08-03 4017 618 1
## 9 "\"Barrels O' Fu~ 0 2019-08-03 4019 4019 1
## 10 "Lumen Lite" 0 2019-08-03 4002 3906 1
## # ... with 7,454 more rows
The last of the cleaning before we start modeling is to remove unnecessary variables and separate the language and genre variables by their delimiters shown in the before and after.
## # A tibble: 7,488 x 3
## Name Languages Genres
## <chr> <chr> <chr>
## 1 "Sudoku" DA, NL, EN, FI, FR, DE, IT, JA, KO,~ Games, Strategy, Puz~
## 2 "Reversi" EN Games, Strategy, Boa~
## 3 "Morocco" EN Games, Board, Strate~
## 4 "Sudoku (Free)" DA, NL, EN, FI, FR, DE, IT, JA, KO,~ Games, Strategy, Puz~
## 5 "Senet Deluxe" DA, NL, EN, FR, DE, EL, IT, JA, KO,~ Games, Strategy, Boa~
## 6 "Sudoku - Classic~ EN Games, Entertainment~
## 7 "Gravitation" <NA> Games, Entertainment~
## 8 "Colony" EN Games, Strategy, Boa~
## 9 "Carte" FR Games, Strategy, Boa~
## 10 "\"Barrels O' Fun~ EN Games, Casual, Strat~
## # ... with 7,478 more rows
## # A tibble: 101,229 x 3
## Name Languages Genres
## <chr> <chr> <chr>
## 1 Sudoku DA Games
## 2 Sudoku DA Strategy
## 3 Sudoku DA Puzzle
## 4 Sudoku NL Games
## 5 Sudoku NL Strategy
## 6 Sudoku NL Puzzle
## 7 Sudoku EN Games
## 8 Sudoku EN Strategy
## 9 Sudoku EN Puzzle
## 10 Sudoku FI Games
## # ... with 101,219 more rows
## # A tibble: 101,229 x 13
## `Average User Rating` `User Rating Count` `Age Rating` Languages Size
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 4 3553 4+ DA 15853568
## 2 4 3553 4+ DA 15853568
## 3 4 3553 4+ DA 15853568
## 4 4 3553 4+ NL 15853568
## 5 4 3553 4+ NL 15853568
## 6 4 3553 4+ NL 15853568
## 7 4 3553 4+ EN 15853568
## 8 4 3553 4+ EN 15853568
## 9 4 3553 4+ EN 15853568
## 10 4 3553 4+ FI 15853568
## # ... with 101,219 more rows, and 8 more variables: Primary Genre <chr>,
## # Genres <chr>, sum.iap <dbl>, count.iap <int>, iap.class <chr>,
## # days.since.release <dbl>, days.since.last.update <dbl>, free <dbl>
The first model we created was a full width base model. This meant that we would use the base information given, with none of the new variables we made. The base predictors were average user rating, user rating count, age rating, languages, primary genre, sub genre and the size of the app. Because we are using logistic regression we used the glm function with family equal to binomial to predict if the app was free.
We also created a function to find the misclassification error to decrease redundancy.
## # A tibble: 101,229 x 8
## `Average User Rating` `User Rating Count` `Age Rating` Languages Size
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 4 3553 4+ DA 15853568
## 2 4 3553 4+ DA 15853568
## 3 4 3553 4+ DA 15853568
## 4 4 3553 4+ NL 15853568
## 5 4 3553 4+ NL 15853568
## 6 4 3553 4+ NL 15853568
## 7 4 3553 4+ EN 15853568
## 8 4 3553 4+ EN 15853568
## 9 4 3553 4+ EN 15853568
## 10 4 3553 4+ FI 15853568
## # ... with 101,219 more rows, and 3 more variables: Primary Genre <chr>,
## # Genres <chr>, free <dbl>
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = free ~ ., family = binomial(), data = logit.data.orig)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2428 0.3306 0.5118 0.5927 2.1978
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.645e+01 3.603e+02 0.046 0.963577
## `Average User Rating` 1.006e-01 1.326e-02 7.588 3.24e-14 ***
## `User Rating Count` 1.354e-05 9.251e-07 14.631 < 2e-16 ***
## `Age Rating`17+ 3.639e-01 6.694e-02 5.437 5.43e-08 ***
## `Age Rating`4+ -2.647e-01 2.750e-02 -9.624 < 2e-16 ***
## `Age Rating`9+ -4.281e-01 2.961e-02 -14.457 < 2e-16 ***
## LanguagesAM 1.503e+01 9.037e+02 0.017 0.986727
## LanguagesAR 1.599e+00 6.574e-01 2.432 0.015035 *
## LanguagesAS 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesAY 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesAZ 1.460e+01 6.918e+02 0.021 0.983161
## LanguagesBE 1.468e+01 5.975e+02 0.025 0.980403
## LanguagesBG 2.117e+00 8.163e-01 2.594 0.009499 **
## LanguagesBN 1.492e+01 2.135e+02 0.070 0.944299
## LanguagesBO 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesBR 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesBS 1.459e+01 5.405e+02 0.027 0.978456
## LanguagesCA 1.179e+00 6.596e-01 1.788 0.073781 .
## LanguagesCS 9.104e-01 6.478e-01 1.405 0.159887
## LanguagesCY 1.507e+01 8.991e+02 0.017 0.986632
## LanguagesDA 6.123e-01 6.458e-01 0.948 0.343076
## LanguagesDE 5.381e-03 6.404e-01 0.008 0.993296
## LanguagesDZ 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesEL 9.919e-01 6.499e-01 1.526 0.126944
## LanguagesEN 1.708e-01 6.397e-01 0.267 0.789466
## LanguagesEO 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesES 1.004e-01 6.405e-01 0.157 0.875451
## LanguagesET 1.459e+01 4.283e+02 0.034 0.972816
## LanguagesEU 1.460e+01 6.918e+02 0.021 0.983161
## LanguagesFA 1.124e+00 7.063e-01 1.592 0.111482
## LanguagesFI 6.498e-01 6.472e-01 1.004 0.315407
## LanguagesFO 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesFR 1.507e-02 6.404e-01 0.024 0.981220
## LanguagesGA 1.384e+01 6.201e+02 0.022 0.982188
## LanguagesGD 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesGL 1.460e+01 6.918e+02 0.021 0.983161
## LanguagesGN 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesGU 1.495e+01 2.341e+02 0.064 0.949070
## LanguagesGV 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesHE 9.322e-01 6.523e-01 1.429 0.152968
## LanguagesHI 1.351e+00 6.842e-01 1.975 0.048253 *
## LanguagesHR 1.589e+00 7.102e-01 2.237 0.025272 *
## LanguagesHU 1.006e+00 6.528e-01 1.541 0.123335
## LanguagesHY 1.500e+01 3.408e+02 0.044 0.964892
## LanguagesID 1.380e+00 6.506e-01 2.121 0.033901 *
## LanguagesIS 1.482e+01 5.969e+02 0.025 0.980189
## LanguagesIT 2.075e-01 6.409e-01 0.324 0.746090
## LanguagesIU 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesJA 2.869e-01 6.407e-01 0.448 0.654329
## LanguagesJV 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesKA 1.460e+01 6.918e+02 0.021 0.983161
## LanguagesKK 1.434e+01 8.010e+02 0.018 0.985716
## LanguagesKL 1.518e+01 8.363e+02 0.018 0.985515
## LanguagesKM 1.493e+01 7.552e+02 0.020 0.984231
## LanguagesKN 1.492e+01 2.256e+02 0.066 0.947279
## LanguagesKO 2.698e-01 6.410e-01 0.421 0.673797
## LanguagesKR 1.520e+01 8.441e+02 0.018 0.985629
## LanguagesKS 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesKU 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesKY 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesLA 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesLO 1.493e+01 7.552e+02 0.020 0.984231
## LanguagesLT 8.708e-01 8.797e-01 0.990 0.322234
## LanguagesLV 1.491e+01 2.243e+02 0.066 0.946991
## LanguagesMG 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesMK 1.473e+01 5.078e+02 0.029 0.976857
## LanguagesML 1.493e+01 2.448e+02 0.061 0.951375
## LanguagesMN 1.503e+01 9.037e+02 0.017 0.986727
## LanguagesMR 1.496e+01 2.308e+02 0.065 0.948329
## LanguagesMS 1.308e+00 6.563e-01 1.993 0.046211 *
## LanguagesMT 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesMY 1.451e+01 6.082e+02 0.024 0.980968
## LanguagesNB 6.055e-01 6.467e-01 0.936 0.349101
## LanguagesNE 1.460e+01 6.918e+02 0.021 0.983161
## LanguagesNL 3.977e-01 6.425e-01 0.619 0.535925
## LanguagesNN 5.536e-01 8.289e-01 0.668 0.504231
## LanguagesNO 2.245e-01 6.849e-01 0.328 0.743026
## LanguagesOM 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesOR 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesPA 1.494e+01 2.498e+02 0.060 0.952314
## LanguagesPL 3.334e-01 6.427e-01 0.519 0.603903
## LanguagesPS 1.434e+01 8.010e+02 0.018 0.985716
## LanguagesPT 3.269e-01 6.411e-01 0.510 0.610111
## LanguagesQU 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesRN 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesRO 1.760e+00 6.679e-01 2.636 0.008398 **
## LanguagesRU 2.485e-01 6.408e-01 0.388 0.698193
## LanguagesRW 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesSA 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesSD 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesSE 7.954e-02 8.434e-01 0.094 0.924859
## LanguagesSI 1.474e+01 6.183e+02 0.024 0.980983
## LanguagesSK 9.631e-01 6.538e-01 1.473 0.140702
## LanguagesSL 1.887e+00 8.174e-01 2.309 0.020963 *
## LanguagesSO 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesSQ 1.481e+01 4.669e+02 0.032 0.974685
## LanguagesSR 9.674e-01 8.244e-01 1.173 0.240613
## LanguagesSU 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesSV 3.141e-01 6.431e-01 0.488 0.625204
## LanguagesSW 1.460e+01 6.918e+02 0.021 0.983161
## LanguagesTA 1.492e+01 2.256e+02 0.066 0.947279
## LanguagesTE 1.489e+01 2.387e+02 0.062 0.950264
## LanguagesTG 1.434e+01 8.010e+02 0.018 0.985716
## LanguagesTH 1.073e+00 6.473e-01 1.658 0.097239 .
## LanguagesTI 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesTK 1.434e+01 8.010e+02 0.018 0.985716
## LanguagesTL 1.519e+01 4.676e+02 0.032 0.974082
## LanguagesTO 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesTR 7.150e-01 6.431e-01 1.112 0.266236
## LanguagesTT 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesUG 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesUK 1.363e+00 6.624e-01 2.057 0.039692 *
## LanguagesUR 1.480e+01 3.942e+02 0.038 0.970056
## LanguagesUZ 1.434e+01 8.010e+02 0.018 0.985716
## LanguagesVI 1.248e+00 6.510e-01 1.916 0.055333 .
## LanguagesYI 1.496e+01 1.196e+03 0.013 0.990017
## LanguagesZH 3.911e-01 6.406e-01 0.610 0.541578
## LanguagesZU 1.513e+01 1.385e+03 0.011 0.991280
## Size -8.699e-10 2.910e-11 -29.891 < 2e-16 ***
## `Primary Genre`Business -1.854e+01 3.603e+02 -0.051 0.958963
## `Primary Genre`Education -1.652e+01 3.603e+02 -0.046 0.963418
## `Primary Genre`Entertainment -1.358e+01 3.603e+02 -0.038 0.969933
## `Primary Genre`Finance -1.410e+01 3.603e+02 -0.039 0.968781
## `Primary Genre`Food & Drink -4.442e-01 1.092e+03 0.000 0.999675
## `Primary Genre`Games -1.480e+01 3.603e+02 -0.041 0.967231
## `Primary Genre`Health & Fitness -1.709e+01 3.603e+02 -0.047 0.962156
## `Primary Genre`Lifestyle -1.728e+01 3.603e+02 -0.048 0.961743
## `Primary Genre`Medical 1.065e+00 5.470e+02 0.002 0.998447
## `Primary Genre`Music -9.439e-01 4.132e+02 -0.002 0.998177
## `Primary Genre`Navigation -7.060e-01 1.172e+03 -0.001 0.999519
## `Primary Genre`News -2.954e-01 9.208e+02 0.000 0.999744
## `Primary Genre`Productivity -1.835e+01 3.603e+02 -0.051 0.959381
## `Primary Genre`Reference -1.538e+01 3.603e+02 -0.043 0.965956
## `Primary Genre`Shopping 9.306e-02 1.431e+03 0.000 0.999948
## `Primary Genre`Social Networking -9.199e-01 4.626e+02 -0.002 0.998413
## `Primary Genre`Sports -1.429e+01 3.603e+02 -0.040 0.968367
## `Primary Genre`Stickers -1.590e+01 3.603e+02 -0.044 0.964792
## `Primary Genre`Travel -9.081e-02 1.252e+03 0.000 0.999942
## `Primary Genre`Utilities -1.614e+01 3.603e+02 -0.045 0.964259
## GenresAdventure -2.582e-01 9.506e-02 -2.716 0.006603 **
## GenresBoard -1.018e+00 6.734e-02 -15.110 < 2e-16 ***
## GenresBooks -2.643e-01 6.422e-01 -0.412 0.680648
## GenresBusiness -4.916e-01 3.436e-01 -1.431 0.152423
## GenresCard -1.975e-01 1.007e-01 -1.962 0.049765 *
## GenresCasino 1.458e+00 7.191e-01 2.028 0.042591 *
## GenresCasual 4.888e-01 1.143e-01 4.277 1.89e-05 ***
## GenresDrink 1.558e+00 7.208e-01 2.161 0.030694 *
## GenresEducation -9.034e-01 1.020e-01 -8.858 < 2e-16 ***
## GenresEmoji -1.013e+00 1.579e+00 -0.641 0.521225
## GenresEntertainment -1.628e-01 5.835e-02 -2.789 0.005279 **
## GenresExpressions -1.013e+00 1.579e+00 -0.641 0.521225
## GenresFamily 5.840e-01 1.285e-01 4.546 5.46e-06 ***
## GenresFinance 1.100e+00 7.300e-01 1.507 0.131758
## GenresFitness 1.757e+00 1.086e+00 1.618 0.105558
## GenresFood 1.558e+00 7.208e-01 2.161 0.030694 *
## GenresGames -2.384e-01 5.486e-02 -4.345 1.39e-05 ***
## GenresGaming -2.126e-01 1.413e+00 -0.150 0.880401
## GenresHealth 1.757e+00 1.086e+00 1.618 0.105558
## GenresKids 1.586e+01 2.400e+03 0.007 0.994726
## GenresLifestyle 1.025e+00 2.653e-01 3.864 0.000112 ***
## GenresMagazines 1.454e+01 2.400e+03 0.006 0.995166
## GenresMedical -2.786e+00 1.226e+00 -2.272 0.023089 *
## GenresMusic 1.704e+00 5.088e-01 3.349 0.000810 ***
## GenresNavigation 1.408e+01 9.375e+02 0.015 0.988018
## GenresNetworking 1.267e+00 3.266e-01 3.879 0.000105 ***
## GenresNews -1.993e-01 1.084e+00 -0.184 0.854106
## GenresNewspapers 1.454e+01 2.400e+03 0.006 0.995166
## GenresPhoto 1.470e+01 7.944e+02 0.019 0.985239
## GenresPlaying 1.615e-01 7.567e-02 2.134 0.032827 *
## GenresProductivity 2.052e-01 4.647e-01 0.442 0.658730
## GenresPuzzle -3.900e-01 6.786e-02 -5.747 9.11e-09 ***
## GenresRacing 5.463e-01 3.559e-01 1.535 0.124766
## GenresReference -9.045e-01 2.845e-01 -3.180 0.001473 **
## GenresRole 1.615e-01 7.567e-02 2.134 0.032827 *
## GenresShopping -2.904e-01 2.770e+03 0.000 0.999916
## GenresSimulation -4.561e-01 6.540e-02 -6.974 3.08e-12 ***
## GenresSocial 1.267e+00 3.266e-01 3.879 0.000105 ***
## GenresSports 8.742e-01 1.632e-01 5.358 8.40e-08 ***
## GenresStickers -2.126e-01 1.413e+00 -0.150 0.880401
## GenresStrategy -2.384e-01 5.486e-02 -4.346 1.39e-05 ***
## GenresTravel -2.837e-01 4.499e-01 -0.631 0.528264
## GenresTrivia 1.402e+00 3.329e-01 4.210 2.56e-05 ***
## GenresUtilities 2.995e-03 2.130e-01 0.014 0.988778
## GenresVideo 1.470e+01 7.944e+02 0.019 0.985239
## GenresWeather -1.839e+01 2.400e+03 -0.008 0.993886
## GenresWord 4.724e-01 4.710e-01 1.003 0.315914
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 81752 on 101228 degrees of freedom
## Residual deviance: 76180 on 101043 degrees of freedom
## AIC: 76552
##
## Number of Fisher Scoring iterations: 15
## [1] 0.1372433
Looking at the coefficients of the base full width model. As you can see, the model’s quite unsightly. There are a lot of insignificant coefficients with only 11/113 of the languages being significant, none of the primary genres being significant and half of the sub genres are significant. However, for a base model, a misclassification error or mce of 0.137 is not bad.
Next, we added on the cleaned predictors we made to our original model and compared the misclassification errors. The added predictors were: sum of in-app purchases or sum.iap, count.iap, iap.class, days since release and days since last update.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = free ~ ., family = binomial(), data = logit.data.clean)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5708 0.0947 0.2637 0.4962 3.1204
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.769e+01 5.658e+02 0.031 0.975055
## `Average User Rating` -1.606e-01 1.562e-02 -10.281 < 2e-16 ***
## `User Rating Count` 1.239e-05 9.048e-07 13.699 < 2e-16 ***
## `Age Rating`17+ 8.972e-01 7.647e-02 11.734 < 2e-16 ***
## `Age Rating`4+ 6.009e-01 3.342e-02 17.980 < 2e-16 ***
## `Age Rating`9+ -6.091e-01 3.593e-02 -16.953 < 2e-16 ***
## LanguagesAM 1.662e+01 1.301e+03 0.013 0.989808
## LanguagesAR 1.012e+00 7.410e-01 1.366 0.171826
## LanguagesAS 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesAY 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesAZ 1.613e+01 1.014e+03 0.016 0.987307
## LanguagesBE 1.581e+01 8.487e+02 0.019 0.985138
## LanguagesBG 2.284e+00 9.075e-01 2.517 0.011849 *
## LanguagesBN 1.477e+01 3.142e+02 0.047 0.962505
## LanguagesBO 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesBR 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesBS 1.508e+01 8.691e+02 0.017 0.986159
## LanguagesCA 6.967e-01 7.441e-01 0.936 0.349100
## LanguagesCS 6.976e-01 7.318e-01 0.953 0.340461
## LanguagesCY 1.527e+01 1.439e+03 0.011 0.991538
## LanguagesDA 4.432e-01 7.302e-01 0.607 0.543865
## LanguagesDE -1.192e-01 7.243e-01 -0.165 0.869280
## LanguagesDZ 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesEL 7.500e-01 7.343e-01 1.021 0.307043
## LanguagesEN 2.818e-01 7.235e-01 0.390 0.696878
## LanguagesEO 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesES -6.180e-02 7.245e-01 -0.085 0.932025
## LanguagesET 1.528e+01 6.070e+02 0.025 0.979918
## LanguagesEU 1.613e+01 1.014e+03 0.016 0.987307
## LanguagesFA 9.259e-01 8.019e-01 1.155 0.248219
## LanguagesFI 4.405e-01 7.316e-01 0.602 0.547067
## LanguagesFO 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesFR -9.412e-02 7.244e-01 -0.130 0.896622
## LanguagesGA 1.386e+01 1.057e+03 0.013 0.989545
## LanguagesGD 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesGL 1.613e+01 1.014e+03 0.016 0.987307
## LanguagesGN 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesGU 1.388e+01 3.622e+02 0.038 0.969424
## LanguagesGV 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesHE 6.088e-01 7.366e-01 0.826 0.408521
## LanguagesHI 9.262e-01 7.724e-01 1.199 0.230470
## LanguagesHR 9.225e-01 7.940e-01 1.162 0.245305
## LanguagesHU 4.824e-01 7.367e-01 0.655 0.512623
## LanguagesHY 1.631e+01 4.869e+02 0.033 0.973279
## LanguagesID 1.001e+00 7.348e-01 1.363 0.172907
## LanguagesIS 1.568e+01 8.083e+02 0.019 0.984522
## LanguagesIT 3.862e-02 7.249e-01 0.053 0.957513
## LanguagesIU 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesJA 9.991e-02 7.247e-01 0.138 0.890351
## LanguagesJV 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesKA 1.613e+01 1.014e+03 0.016 0.987307
## LanguagesKK 1.382e+01 1.257e+03 0.011 0.991226
## LanguagesKL 1.564e+01 1.335e+03 0.012 0.990654
## LanguagesKM 1.605e+01 9.823e+02 0.016 0.986960
## LanguagesKN 1.448e+01 3.323e+02 0.044 0.965252
## LanguagesKO -1.151e-01 7.250e-01 -0.159 0.873877
## LanguagesKR 1.570e+01 1.353e+03 0.012 0.990738
## LanguagesKS 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesKU 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesKY 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesLA 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesLO 1.605e+01 9.823e+02 0.016 0.986960
## LanguagesLT 8.080e-01 9.738e-01 0.830 0.406653
## LanguagesLV 1.466e+01 3.290e+02 0.045 0.964465
## LanguagesMG 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesMK 1.561e+01 7.233e+02 0.022 0.982786
## LanguagesML 1.445e+01 3.592e+02 0.040 0.967918
## LanguagesMN 1.662e+01 1.301e+03 0.013 0.989808
## LanguagesMR 1.452e+01 3.414e+02 0.043 0.966074
## LanguagesMS 9.726e-01 7.407e-01 1.313 0.189153
## LanguagesMT 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesMY 1.568e+01 8.129e+02 0.019 0.984615
## LanguagesNB 3.984e-01 7.312e-01 0.545 0.585882
## LanguagesNE 1.613e+01 1.014e+03 0.016 0.987307
## LanguagesNL 2.372e-01 7.266e-01 0.327 0.744044
## LanguagesNN 9.433e-01 9.251e-01 1.020 0.307865
## LanguagesNO 8.953e-01 7.861e-01 1.139 0.254731
## LanguagesOM 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesOR 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesPA 1.356e+01 3.849e+02 0.035 0.971902
## LanguagesPL 1.898e-01 7.269e-01 0.261 0.794048
## LanguagesPS 1.382e+01 1.257e+03 0.011 0.991226
## LanguagesPT 7.204e-02 7.250e-01 0.099 0.920855
## LanguagesQU 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesRN 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesRO 1.367e+00 7.514e-01 1.819 0.068855 .
## LanguagesRU -1.052e-02 7.247e-01 -0.015 0.988416
## LanguagesRW 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesSA 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesSD 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesSE 5.722e-01 1.014e+00 0.564 0.572510
## LanguagesSI 1.593e+01 8.979e+02 0.018 0.985849
## LanguagesSK 4.184e-01 7.378e-01 0.567 0.570662
## LanguagesSL 1.175e+00 9.102e-01 1.291 0.196783
## LanguagesSO 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesSQ 1.565e+01 6.801e+02 0.023 0.981635
## LanguagesSR 1.162e+00 9.104e-01 1.276 0.201985
## LanguagesSU 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesSV 2.146e-01 7.272e-01 0.295 0.767913
## LanguagesSW 1.613e+01 1.014e+03 0.016 0.987307
## LanguagesTA 1.448e+01 3.323e+02 0.044 0.965252
## LanguagesTE 1.440e+01 3.489e+02 0.041 0.967079
## LanguagesTG 1.382e+01 1.257e+03 0.011 0.991226
## LanguagesTH 6.569e-01 7.318e-01 0.898 0.369392
## LanguagesTI 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesTK 1.382e+01 1.257e+03 0.011 0.991226
## LanguagesTL 1.561e+01 6.843e+02 0.023 0.981798
## LanguagesTO 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesTR 4.084e-01 7.272e-01 0.562 0.574343
## LanguagesTT 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesUG 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesUK 1.043e+00 7.473e-01 1.396 0.162620
## LanguagesUR 1.507e+01 6.271e+02 0.024 0.980828
## LanguagesUZ 1.382e+01 1.257e+03 0.011 0.991226
## LanguagesVI 8.172e-01 7.350e-01 1.112 0.266235
## LanguagesYI 1.448e+01 1.976e+03 0.007 0.994155
## LanguagesZH 1.580e-01 7.245e-01 0.218 0.827350
## LanguagesZU 1.769e+01 2.283e+03 0.008 0.993817
## Size -1.692e-09 4.364e-11 -38.783 < 2e-16 ***
## `Primary Genre`Business -1.987e+01 5.658e+02 -0.035 0.971988
## `Primary Genre`Education -1.698e+01 5.658e+02 -0.030 0.976063
## `Primary Genre`Entertainment -1.390e+01 5.658e+02 -0.025 0.980396
## `Primary Genre`Finance -1.608e+01 5.658e+02 -0.028 0.977326
## `Primary Genre`Food & Drink 3.372e-01 1.832e+03 0.000 0.999853
## `Primary Genre`Games -1.562e+01 5.658e+02 -0.028 0.977979
## `Primary Genre`Health & Fitness -1.963e+01 5.658e+02 -0.035 0.972330
## `Primary Genre`Lifestyle -1.891e+01 5.658e+02 -0.033 0.973342
## `Primary Genre`Medical -1.437e+01 9.196e+02 -0.016 0.987535
## `Primary Genre`Music -4.403e-01 6.574e+02 -0.001 0.999466
## `Primary Genre`Navigation -8.158e-01 1.914e+03 0.000 0.999660
## `Primary Genre`News 4.942e-01 1.496e+03 0.000 0.999736
## `Primary Genre`Productivity -2.083e+01 5.658e+02 -0.037 0.970633
## `Primary Genre`Reference -1.571e+01 5.658e+02 -0.028 0.977844
## `Primary Genre`Shopping 5.897e-01 2.352e+03 0.000 0.999800
## `Primary Genre`Social Networking 9.429e-01 7.467e+02 0.001 0.998992
## `Primary Genre`Sports -1.357e+01 5.658e+02 -0.024 0.980870
## `Primary Genre`Stickers -1.671e+01 5.658e+02 -0.030 0.976440
## `Primary Genre`Travel 9.043e-01 2.047e+03 0.000 0.999648
## `Primary Genre`Utilities -1.694e+01 5.658e+02 -0.030 0.976113
## GenresAdventure -9.832e-02 1.071e-01 -0.918 0.358385
## GenresBoard -4.650e-01 7.714e-02 -6.029 1.65e-09 ***
## GenresBooks -1.924e-01 6.811e-01 -0.282 0.777582
## GenresBusiness -1.517e+00 4.286e-01 -3.538 0.000403 ***
## GenresCard 8.825e-02 1.139e-01 0.775 0.438540
## GenresCasino 1.571e+00 7.276e-01 2.159 0.030850 *
## GenresCasual 6.757e-01 1.280e-01 5.280 1.29e-07 ***
## GenresDrink 7.567e-01 7.442e-01 1.017 0.309268
## GenresEducation -2.767e-01 1.167e-01 -2.371 0.017746 *
## GenresEmoji -7.134e-01 1.593e+00 -0.448 0.654271
## GenresEntertainment -7.729e-02 6.641e-02 -1.164 0.244498
## GenresExpressions -7.134e-01 1.593e+00 -0.448 0.654271
## GenresFamily 2.734e-01 1.396e-01 1.959 0.050089 .
## GenresFinance 1.299e+00 7.840e-01 1.657 0.097569 .
## GenresFitness 1.106e+00 1.192e+00 0.928 0.353464
## GenresFood 7.567e-01 7.442e-01 1.017 0.309268
## GenresGames -1.914e-01 6.252e-02 -3.061 0.002203 **
## GenresGaming -1.962e-01 1.430e+00 -0.137 0.890849
## GenresHealth 1.106e+00 1.192e+00 0.928 0.353464
## GenresKids 1.626e+01 3.956e+03 0.004 0.996720
## GenresLifestyle 1.926e+00 2.605e-01 7.393 1.43e-13 ***
## GenresMagazines 1.400e+01 3.956e+03 0.004 0.997177
## GenresMedical -1.678e+00 1.229e+00 -1.366 0.171968
## GenresMusic 1.810e+00 5.175e-01 3.498 0.000469 ***
## GenresNavigation 1.551e+01 1.416e+03 0.011 0.991260
## GenresNetworking 3.440e-01 3.447e-01 0.998 0.318405
## GenresNews 5.213e-01 1.174e+00 0.444 0.657054
## GenresNewspapers 1.400e+01 3.956e+03 0.004 0.997177
## GenresPhoto 1.543e+01 1.196e+03 0.013 0.989709
## GenresPlaying -4.054e-01 8.811e-02 -4.601 4.21e-06 ***
## GenresProductivity 1.020e-01 4.502e-01 0.226 0.820845
## GenresPuzzle -1.003e-02 7.648e-02 -0.131 0.895698
## GenresRacing -3.996e-01 3.989e-01 -1.002 0.316393
## GenresReference -9.283e-01 3.215e-01 -2.887 0.003884 **
## GenresRole -4.054e-01 8.811e-02 -4.601 4.21e-06 ***
## GenresShopping -1.332e-01 4.567e+03 0.000 0.999977
## GenresSimulation -8.519e-01 7.628e-02 -11.168 < 2e-16 ***
## GenresSocial 3.440e-01 3.447e-01 0.998 0.318405
## GenresSports 4.094e-01 1.743e-01 2.349 0.018841 *
## GenresStickers -1.962e-01 1.430e+00 -0.137 0.890849
## GenresStrategy -1.915e-01 6.252e-02 -3.063 0.002195 **
## GenresTravel -5.789e-01 5.062e-01 -1.143 0.252838
## GenresTrivia 1.612e+00 3.408e-01 4.731 2.24e-06 ***
## GenresUtilities 1.243e-01 2.268e-01 0.548 0.583510
## GenresVideo 1.543e+01 1.196e+03 0.013 0.989709
## GenresWeather -1.984e+01 3.956e+03 -0.005 0.995998
## GenresWord 6.339e-01 5.128e-01 1.236 0.216410
## sum.iap 2.593e-02 8.049e-04 32.217 < 2e-16 ***
## count.iap -8.947e-02 6.639e-03 -13.475 < 2e-16 ***
## iap.class$0.01-$10.00 1.591e+00 3.153e-02 50.454 < 2e-16 ***
## iap.class$10.01-$20.00 1.170e+00 7.322e-02 15.975 < 2e-16 ***
## iap.class$20.01-$30.00 -3.538e-01 1.419e-01 -2.494 0.012619 *
## iap.class$30.01-$40.00 -7.892e-01 1.396e-01 -5.654 1.57e-08 ***
## iap.class$40.01-$80.00 9.810e+00 1.182e+02 0.083 0.933835
## days.since.release -8.451e-04 1.335e-05 -63.289 < 2e-16 ***
## days.since.last.update 7.579e-04 1.684e-05 45.015 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 81752 on 101228 degrees of freedom
## Residual deviance: 57415 on 101034 degrees of freedom
## AIC: 57805
##
## Number of Fisher Scoring iterations: 16
## [1] 0.1150164
Unfortunately, there are even less significant language variables with 2/113 being significant and less significant sub genres with 17/48 being significant. Also, the iap class of $40 to $80 is insignificant as well. On the other hand, we have a new high score and we were able to decrease our mce to 0.115.
SInce the clean full width model had a smaller mce, we filtered to keep only the significant variables and redid the model. However, filtering left the dataset with 592 expanded rows out of the once 99,000 rows. And because they were expanded, in reality it’s probably only about 200 different apps so the glm function was unable to run due to too few variables.
## # A tibble: 569 x 12
## `Average User Rati~ `User Rating Coun~ `Age Rating` Languages Size Genres
## <dbl> <dbl> <chr> <chr> <dbl> <chr>
## 1 4.5 143719 4+ BG 1.10e8 Games
## 2 4.5 143719 4+ BG 1.10e8 Strate~
## 3 4.5 143719 4+ BG 1.10e8 Board
## 4 4.5 143719 4+ RO 1.10e8 Games
## 5 4.5 143719 4+ RO 1.10e8 Strate~
## 6 4.5 143719 4+ RO 1.10e8 Board
## 7 3 3909 17+ RO 1.31e8 Games
## 8 3 3909 17+ RO 1.31e8 Simula~
## 9 3 3909 17+ RO 1.31e8 Strate~
## 10 3.5 244 9+ RO 5.12e7 Games
## # ... with 559 more rows, and 6 more variables: sum.iap <dbl>, count.iap <int>,
## # iap.class <chr>, free <dbl>, days.since.release <dbl>,
## # days.since.last.update <dbl>
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
So instead of filtering by “and” we filtered by “or.” Meaning that as long as a row contained any significant variable, we would use it for the model. And as you can see, we were left with significantly more data to work with, containing 93,000 rows. Also, for the sig.or model, because previously none of the Primary Genres were significant with p values around 0.95, we removed the variable completely.
## # A tibble: 93,152 x 12
## `Average User Rati~ `User Rating Coun~ `Age Rating` Size Genres Languages
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 4 3553 4+ 1.59e7 Games DA
## 2 4 3553 4+ 1.59e7 Strate~ DA
## 3 4 3553 4+ 1.59e7 Games NL
## 4 4 3553 4+ 1.59e7 Strate~ NL
## 5 4 3553 4+ 1.59e7 Games EN
## 6 4 3553 4+ 1.59e7 Strate~ EN
## 7 4 3553 4+ 1.59e7 Games FI
## 8 4 3553 4+ 1.59e7 Strate~ FI
## 9 4 3553 4+ 1.59e7 Games FR
## 10 4 3553 4+ 1.59e7 Strate~ FR
## # ... with 93,142 more rows, and 6 more variables: sum.iap <dbl>,
## # count.iap <int>, iap.class <chr>, free <dbl>, days.since.release <dbl>,
## # days.since.last.update <dbl>
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = free ~ ., family = binomial, data = logit.data.sig.or)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5474 0.1017 0.2665 0.4744 2.4329
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.408e+00 7.360e-01 1.913 0.05579 .
## `Average User Rating` -1.085e-01 1.672e-02 -6.492 8.49e-11 ***
## `User Rating Count` 1.260e-05 8.668e-07 14.532 < 2e-16 ***
## `Age Rating`17+ 9.032e-01 8.511e-02 10.612 < 2e-16 ***
## `Age Rating`4+ 5.028e-01 3.495e-02 14.388 < 2e-16 ***
## `Age Rating`9+ -6.209e-01 3.749e-02 -16.563 < 2e-16 ***
## Size -1.579e-09 4.369e-11 -36.141 < 2e-16 ***
## GenresAdventure 1.959e-01 1.649e-01 1.188 0.23483
## GenresBoard -1.674e-01 8.800e-02 -1.902 0.05715 .
## GenresBooks 1.537e+01 8.340e+02 0.018 0.98530
## GenresBusiness -2.040e+00 3.435e-01 -5.937 2.90e-09 ***
## GenresCard 7.597e-01 1.924e-01 3.949 7.85e-05 ***
## GenresCasino 1.851e+00 7.286e-01 2.541 0.01107 *
## GenresCasual 1.017e+00 1.339e-01 7.595 3.07e-14 ***
## GenresDrink 1.441e+01 4.718e+02 0.031 0.97564
## GenresEducation -2.575e-01 1.200e-01 -2.146 0.03190 *
## GenresEntertainment 3.475e-01 8.519e-02 4.079 4.53e-05 ***
## GenresFamily 6.168e-01 1.427e-01 4.322 1.55e-05 ***
## GenresFinance 1.514e+00 7.818e-01 1.937 0.05275 .
## GenresFitness -1.500e-01 1.056e+00 -0.142 0.88696
## GenresFood 1.441e+01 4.718e+02 0.031 0.97564
## GenresGames 8.741e-02 7.560e-02 1.156 0.24760
## GenresHealth -1.500e-01 1.056e+00 -0.142 0.88696
## GenresLifestyle 1.836e+00 2.531e-01 7.255 4.01e-13 ***
## GenresMagazines 1.436e+01 3.956e+03 0.004 0.99710
## GenresMedical -1.833e+01 2.795e+03 -0.007 0.99477
## GenresMusic 2.304e+00 5.154e-01 4.470 7.82e-06 ***
## GenresNavigation 1.471e+01 2.742e+03 0.005 0.99572
## GenresNetworking 3.457e-02 3.416e-01 0.101 0.91938
## GenresNews 1.544e+01 1.876e+03 0.008 0.99344
## GenresNewspapers 1.436e+01 3.956e+03 0.004 0.99710
## GenresPhoto 1.487e+01 1.574e+03 0.009 0.99246
## GenresPlaying -2.117e-01 9.621e-02 -2.200 0.02779 *
## GenresProductivity -4.512e+00 4.416e-01 -10.218 < 2e-16 ***
## GenresPuzzle 4.431e-01 1.210e-01 3.663 0.00025 ***
## GenresRacing -2.074e+00 3.380e-01 -6.137 8.43e-10 ***
## GenresReference -6.434e-01 3.001e-01 -2.144 0.03205 *
## GenresRole -2.117e-01 9.621e-02 -2.200 0.02779 *
## GenresSimulation -5.511e-01 8.653e-02 -6.369 1.90e-10 ***
## GenresSocial 3.457e-02 3.416e-01 0.101 0.91938
## GenresSports 9.937e-01 1.769e-01 5.618 1.93e-08 ***
## GenresStrategy 8.734e-02 7.560e-02 1.155 0.24795
## GenresTravel 1.173e+00 1.040e+00 1.128 0.25942
## GenresTrivia 1.864e+00 3.361e-01 5.545 2.94e-08 ***
## GenresUtilities 1.076e+00 5.393e-01 1.995 0.04607 *
## GenresVideo 1.487e+01 1.574e+03 0.009 0.99246
## GenresWord 1.330e+00 1.029e+00 1.293 0.19611
## LanguagesAM 1.668e+01 1.394e+03 0.012 0.99046
## LanguagesAR 1.347e+00 7.478e-01 1.801 0.07165 .
## LanguagesAS 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesAY 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesAZ 1.613e+01 1.052e+03 0.015 0.98777
## LanguagesBE 1.581e+01 8.774e+02 0.018 0.98562
## LanguagesBG 2.461e+00 9.100e-01 2.704 0.00684 **
## LanguagesBN 1.497e+01 3.118e+02 0.048 0.96172
## LanguagesBO 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesBR 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesBS 1.528e+01 8.868e+02 0.017 0.98626
## LanguagesCA 9.991e-01 7.504e-01 1.331 0.18308
## LanguagesCS 8.977e-01 7.369e-01 1.218 0.22311
## LanguagesCY 1.559e+01 1.434e+03 0.011 0.99133
## LanguagesDA 5.852e-01 7.351e-01 0.796 0.42599
## LanguagesDE 1.129e-01 7.285e-01 0.155 0.87684
## LanguagesDZ 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesEL 9.417e-01 7.395e-01 1.273 0.20285
## LanguagesEN 4.961e-01 7.276e-01 0.682 0.49533
## LanguagesEO 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesES 1.934e-01 7.287e-01 0.265 0.79074
## LanguagesET 1.537e+01 6.049e+02 0.025 0.97973
## LanguagesEU 1.613e+01 1.052e+03 0.015 0.98777
## LanguagesFA 1.219e+00 8.201e-01 1.486 0.13729
## LanguagesFI 6.579e-01 7.369e-01 0.893 0.37198
## LanguagesFO 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesFR 1.370e-01 7.286e-01 0.188 0.85084
## LanguagesGA 1.492e+01 1.002e+03 0.015 0.98812
## LanguagesGD 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesGL 1.613e+01 1.052e+03 0.015 0.98777
## LanguagesGN 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesGU 1.426e+01 3.534e+02 0.040 0.96780
## LanguagesGV 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesHE 8.331e-01 7.417e-01 1.123 0.26133
## LanguagesHI 1.282e+00 7.884e-01 1.626 0.10398
## LanguagesHR 1.427e+00 8.300e-01 1.719 0.08563 .
## LanguagesHU 6.244e-01 7.414e-01 0.842 0.39964
## LanguagesHY 1.636e+01 5.157e+02 0.032 0.97469
## LanguagesID 1.273e+00 7.408e-01 1.719 0.08561 .
## LanguagesIS 1.567e+01 8.334e+02 0.019 0.98500
## LanguagesIT 2.666e-01 7.291e-01 0.366 0.71468
## LanguagesIU 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesJA 3.222e-01 7.289e-01 0.442 0.65851
## LanguagesJV 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesKA 1.613e+01 1.052e+03 0.015 0.98777
## LanguagesKK 1.409e+01 1.261e+03 0.011 0.99109
## LanguagesKL 1.595e+01 1.333e+03 0.012 0.99045
## LanguagesKM 1.607e+01 1.028e+03 0.016 0.98753
## LanguagesKN 1.465e+01 3.306e+02 0.044 0.96466
## LanguagesKO 1.209e-01 7.292e-01 0.166 0.86837
## LanguagesKR 1.581e+01 1.320e+03 0.012 0.99044
## LanguagesKS 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesKU 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesKY 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesLA 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesLO 1.607e+01 1.028e+03 0.016 0.98753
## LanguagesLT 7.773e-01 9.808e-01 0.792 0.42809
## LanguagesLV 1.479e+01 3.298e+02 0.045 0.96423
## LanguagesMG 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesMK 1.564e+01 7.571e+02 0.021 0.98352
## LanguagesML 1.459e+01 3.566e+02 0.041 0.96736
## LanguagesMN 1.668e+01 1.394e+03 0.012 0.99046
## LanguagesMR 1.469e+01 3.386e+02 0.043 0.96540
## LanguagesMS 1.231e+00 7.466e-01 1.649 0.09923 .
## LanguagesMT 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesMY 1.566e+01 8.395e+02 0.019 0.98512
## LanguagesNB 6.132e-01 7.365e-01 0.833 0.40503
## LanguagesNE 1.613e+01 1.052e+03 0.015 0.98777
## LanguagesNL 4.519e-01 7.311e-01 0.618 0.53650
## LanguagesNN 1.607e+00 1.064e+00 1.510 0.13117
## LanguagesNO 8.033e-01 7.963e-01 1.009 0.31302
## LanguagesOM 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesOR 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesPA 1.383e+01 3.865e+02 0.036 0.97146
## LanguagesPL 4.268e-01 7.314e-01 0.584 0.55950
## LanguagesPS 1.409e+01 1.261e+03 0.011 0.99109
## LanguagesPT 3.056e-01 7.293e-01 0.419 0.67522
## LanguagesQU 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesRN 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesRO 1.633e+00 7.550e-01 2.164 0.03050 *
## LanguagesRU 2.458e-01 7.290e-01 0.337 0.73599
## LanguagesRW 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesSA 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesSD 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesSE 5.839e-01 1.040e+00 0.561 0.57452
## LanguagesSI 1.595e+01 9.321e+02 0.017 0.98635
## LanguagesSK 5.797e-01 7.425e-01 0.781 0.43496
## LanguagesSL 1.842e+00 1.045e+00 1.762 0.07802 .
## LanguagesSO 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesSQ 1.574e+01 7.089e+02 0.022 0.98229
## LanguagesSR 1.835e+00 1.045e+00 1.755 0.07918 .
## LanguagesSU 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesSV 4.119e-01 7.318e-01 0.563 0.57351
## LanguagesSW 1.613e+01 1.052e+03 0.015 0.98777
## LanguagesTA 1.465e+01 3.306e+02 0.044 0.96466
## LanguagesTE 1.455e+01 3.477e+02 0.042 0.96661
## LanguagesTG 1.409e+01 1.261e+03 0.011 0.99109
## LanguagesTH 9.369e-01 7.372e-01 1.271 0.20378
## LanguagesTI 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesTK 1.409e+01 1.261e+03 0.011 0.99109
## LanguagesTL 1.571e+01 7.104e+02 0.022 0.98236
## LanguagesTO 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesTR 6.426e-01 7.317e-01 0.878 0.37985
## LanguagesTT 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesUG 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesUK 1.386e+00 7.553e-01 1.835 0.06651 .
## LanguagesUR 1.534e+01 6.052e+02 0.025 0.97978
## LanguagesUZ 1.409e+01 1.261e+03 0.011 0.99109
## LanguagesVI 1.104e+00 7.409e-01 1.490 0.13614
## LanguagesYI 1.476e+01 1.975e+03 0.007 0.99404
## LanguagesZH 3.738e-01 7.288e-01 0.513 0.60801
## LanguagesZU 1.799e+01 2.797e+03 0.006 0.99487
## sum.iap 2.543e-02 7.938e-04 32.031 < 2e-16 ***
## count.iap -8.663e-02 6.574e-03 -13.178 < 2e-16 ***
## iap.class$0.01-$10.00 1.509e+00 3.317e-02 45.490 < 2e-16 ***
## iap.class$10.01-$20.00 1.051e+00 7.180e-02 14.639 < 2e-16 ***
## iap.class$20.01-$30.00 -3.587e-01 1.404e-01 -2.556 0.01060 *
## iap.class$30.01-$40.00 -7.776e-01 1.392e-01 -5.588 2.30e-08 ***
## iap.class$40.01-$80.00 1.007e+01 1.383e+02 0.073 0.94195
## days.since.release -8.209e-04 1.428e-05 -57.473 < 2e-16 ***
## days.since.last.update 7.172e-04 1.812e-05 39.585 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 71103 on 93151 degrees of freedom
## Residual deviance: 50834 on 92984 degrees of freedom
## AIC: 51170
##
## Number of Fisher Scoring iterations: 16
## [1] 0.1077057
Looking at the sig.or model, we have similar significant coefficients to the base full width model in Languages and sub genres, but we have an even lower mce at 0.1077. Our significant variables are size, count and sum iap, days since release and last update, user rating count and age rating.
On logit.err we filtered to make sure that each column had the significant variables, but because there were so few in the languages significant variables we couldn’t get enough data to make a model. Because there were so few significant variables, this time we removed the language column and then filtered by the significant variables
## # A tibble: 54,136 x 11
## `Average User Ratin~ `User Rating Coun~ `Age Rating` Size Genres sum.iap
## <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 3 47 4+ 4.87e7 Games 1.99
## 2 3 47 4+ 4.87e7 Strate~ 1.99
## 3 3 112 4+ 1.23e8 Games 0.99
## 4 3 112 4+ 1.23e8 Strate~ 0.99
## 5 3 112 4+ 1.23e8 Board 0.99
## 6 3 112 4+ 1.23e8 Games 0.99
## 7 3 112 4+ 1.23e8 Strate~ 0.99
## 8 3 112 4+ 1.23e8 Board 0.99
## 9 3 112 4+ 1.23e8 Games 0.99
## 10 3 112 4+ 1.23e8 Strate~ 0.99
## # ... with 54,126 more rows, and 5 more variables: count.iap <int>,
## # iap.class <chr>, free <dbl>, days.since.release <dbl>,
## # days.since.last.update <dbl>
##
## Call:
## glm(formula = free ~ ., family = binomial, data = logit.data.sig.and.nl)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5888 0.1200 0.2508 0.3872 1.6717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.685e+00 1.577e-01 17.025 < 2e-16 ***
## `Average User Rating` 1.256e-02 2.955e-02 0.425 0.670897
## `User Rating Count` 5.673e-06 1.017e-06 5.579 2.41e-08 ***
## `Age Rating`17+ 5.631e-01 1.591e-01 3.539 0.000402 ***
## `Age Rating`4+ 7.357e-01 5.277e-02 13.942 < 2e-16 ***
## `Age Rating`9+ -6.317e-01 5.161e-02 -12.240 < 2e-16 ***
## Size -1.055e-09 5.241e-11 -20.125 < 2e-16 ***
## GenresBusiness -2.227e+00 3.892e-01 -5.722 1.06e-08 ***
## GenresCasino 1.258e+00 1.047e+00 1.201 0.229566
## GenresCasual 1.038e+00 1.823e-01 5.694 1.24e-08 ***
## GenresEducation 1.054e+00 2.509e-01 4.201 2.66e-05 ***
## GenresFamily 3.563e+00 5.854e-01 6.086 1.15e-09 ***
## GenresFinance 1.092e+00 1.034e+00 1.056 0.290830
## GenresGames 3.961e-01 7.901e-02 5.013 5.35e-07 ***
## GenresLifestyle 1.678e+00 6.033e-01 2.782 0.005407 **
## GenresMusic 3.366e+00 1.008e+00 3.338 0.000843 ***
## GenresPlaying 3.722e-01 1.127e-01 3.301 0.000963 ***
## GenresReference -1.780e-01 5.080e-01 -0.350 0.726068
## GenresRole 3.722e-01 1.127e-01 3.301 0.000963 ***
## GenresSimulation -1.889e-01 9.676e-02 -1.952 0.050889 .
## GenresSports 1.477e+00 3.025e-01 4.882 1.05e-06 ***
## GenresStrategy 3.961e-01 7.901e-02 5.013 5.35e-07 ***
## GenresTrivia 1.366e+01 9.998e+01 0.137 0.891308
## sum.iap 2.498e-02 8.917e-04 28.013 < 2e-16 ***
## count.iap -8.592e-02 7.615e-03 -11.282 < 2e-16 ***
## iap.class$10.01-$20.00 -6.603e-01 7.507e-02 -8.796 < 2e-16 ***
## iap.class$20.01-$30.00 -1.885e+00 1.581e-01 -11.925 < 2e-16 ***
## iap.class$30.01-$40.00 -2.178e+00 1.574e-01 -13.836 < 2e-16 ***
## days.since.release -8.818e-04 2.336e-05 -37.755 < 2e-16 ***
## days.since.last.update 3.794e-04 2.963e-05 12.802 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 27563 on 54135 degrees of freedom
## Residual deviance: 22195 on 54106 degrees of freedom
## AIC: 22255
##
## Number of Fisher Scoring iterations: 14
## [1] 0.0691961
After looking through the Language and sub genre column, we found that there are a lot of languages and sub genres with very few occurrences. We figured if a variable occurred in an app 10-30 times it wouldn’t provide much data to successfully predict anything. So we found the counts of all languages and subgenres and if it occurred less than 30 times we named it “other”.
## # A tibble: 6 x 3
## Languages n lang.new
## <chr> <int> <chr>
## 1 SD 1 Other
## 2 SO 1 Other
## 3 SU 1 Other
## 4 TI 1 Other
## 5 TO 1 Other
## 6 TT 1 Other
## # A tibble: 6 x 3
## Genres n genre.new
## <chr> <int> <chr>
## 1 Medical 3 Other
## 2 Stickers 3 Other
## 3 Emoji 2 Other
## 4 Expressions 2 Other
## 5 Kids 1 Other
## 6 Magazines 1 Other
We then combined the new variables to the previous data using left_join and reran practically the same model. The only difference is that instead of filtering for the significant languages and sub genres, renaming any language or sub genre that occurred less than 30 times, “other.”
## # A tibble: 3 x 12
## `Average User Rati~ `User Rating Coun~ `Age Rating` Size genre.new lang.new
## <dbl> <dbl> <chr> <dbl> <chr> <chr>
## 1 4.5 822 12+ 7.78e8 Other Other
## 2 4.5 1026 4+ 5.65e7 Other Other
## 3 4.5 1026 4+ 5.65e7 Other Other
## # ... with 6 more variables: sum.iap <dbl>, count.iap <int>, iap.class <chr>,
## # free <dbl>, days.since.release <dbl>, days.since.last.update <dbl>
##
## Call:
## glm(formula = free ~ ., family = binomial, data = logit.data.count)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.5746 0.1168 0.2447 0.3805 1.6717
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.244e+00 2.646e-01 16.043 < 2e-16 ***
## `Average User Rating` -1.897e-02 2.648e-02 -0.716 0.473711
## `User Rating Count` 6.112e-06 8.679e-07 7.043 1.89e-12 ***
## `Age Rating`17+ 3.386e-01 1.352e-01 2.505 0.012238 *
## `Age Rating`4+ 5.970e-01 4.753e-02 12.560 < 2e-16 ***
## `Age Rating`9+ -7.058e-01 4.647e-02 -15.187 < 2e-16 ***
## Size -1.045e-09 4.618e-11 -22.619 < 2e-16 ***
## genre.newAdventure 1.693e-01 1.650e-01 1.026 0.304943
## genre.newBoard -4.014e-01 1.036e-01 -3.876 0.000106 ***
## genre.newCard 5.421e-01 1.899e-01 2.855 0.004308 **
## genre.newCasual 6.387e-01 1.819e-01 3.511 0.000447 ***
## genre.newEducation 6.439e-01 2.531e-01 2.544 0.010972 *
## genre.newEntertainment 3.170e-01 8.492e-02 3.733 0.000189 ***
## genre.newFamily 3.121e+00 5.854e-01 5.332 9.70e-08 ***
## genre.newGames -4.023e-03 7.731e-02 -0.052 0.958497
## genre.newLifestyle 1.301e+00 6.041e-01 2.153 0.031321 *
## genre.newMusic 2.916e+00 1.009e+00 2.889 0.003863 **
## genre.newNetworking 1.050e-01 3.407e-01 0.308 0.757919
## genre.newOther -1.108e+00 1.777e-01 -6.235 4.50e-10 ***
## genre.newPlaying -6.077e-02 1.095e-01 -0.555 0.578967
## genre.newPuzzle 4.082e-01 1.218e-01 3.350 0.000807 ***
## genre.newRacing -2.067e+00 3.424e-01 -6.036 1.58e-09 ***
## genre.newReference -5.711e-01 5.078e-01 -1.125 0.260711
## genre.newRole -6.077e-02 1.095e-01 -0.555 0.578967
## genre.newSimulation -5.680e-01 9.505e-02 -5.976 2.28e-09 ***
## genre.newSocial 1.050e-01 3.407e-01 0.308 0.757919
## genre.newSports 1.106e+00 3.031e-01 3.648 0.000264 ***
## genre.newStrategy -4.023e-03 7.731e-02 -0.052 0.958497
## genre.newTravel 1.172e+00 1.046e+00 1.120 0.262836
## genre.newTrivia 1.430e+01 1.635e+02 0.087 0.930308
## genre.newUtilities 1.124e+00 5.421e-01 2.073 0.038182 *
## genre.newWord 1.309e+00 1.033e+00 1.267 0.205024
## lang.newBG 1.184e+01 1.905e+02 0.062 0.950447
## lang.newBN 1.224e+01 2.041e+02 0.060 0.952174
## lang.newCA -1.867e-01 3.549e-01 -0.526 0.598796
## lang.newCS -2.865e-01 2.940e-01 -0.975 0.329776
## lang.newDA -1.074e+00 2.575e-01 -4.171 3.04e-05 ***
## lang.newDE -1.143e+00 2.282e-01 -5.008 5.51e-07 ***
## lang.newEL -6.216e-01 2.805e-01 -2.216 0.026674 *
## lang.newEN -9.176e-01 2.227e-01 -4.120 3.78e-05 ***
## lang.newES -1.060e+00 2.293e-01 -4.621 3.81e-06 ***
## lang.newFA -9.519e-01 5.599e-01 -1.700 0.089102 .
## lang.newFI -9.902e-01 2.672e-01 -3.706 0.000210 ***
## lang.newFR -1.095e+00 2.289e-01 -4.784 1.72e-06 ***
## lang.newHE -8.460e-01 2.817e-01 -3.003 0.002675 **
## lang.newHI -1.777e-01 4.747e-01 -0.374 0.708073
## lang.newHR 2.396e-01 6.272e-01 0.382 0.702457
## lang.newHU -1.053e+00 2.776e-01 -3.792 0.000149 ***
## lang.newID -8.781e-02 2.975e-01 -0.295 0.767911
## lang.newIT -1.010e+00 2.320e-01 -4.355 1.33e-05 ***
## lang.newJA -1.044e+00 2.299e-01 -4.541 5.60e-06 ***
## lang.newKO -1.208e+00 2.311e-01 -5.229 1.70e-07 ***
## lang.newMS -1.465e-01 3.211e-01 -0.456 0.648313
## lang.newNB -8.904e-01 2.686e-01 -3.314 0.000919 ***
## lang.newNL -7.945e-01 2.461e-01 -3.228 0.001247 **
## lang.newNO -1.544e+00 4.679e-01 -3.299 0.000969 ***
## lang.newOther 9.856e-01 4.673e-01 2.109 0.034943 *
## lang.newPL -9.147e-01 2.460e-01 -3.718 0.000200 ***
## lang.newPT -9.440e-01 2.330e-01 -4.051 5.10e-05 ***
## lang.newRO 4.875e-01 4.420e-01 1.103 0.270028
## lang.newRU -9.776e-01 2.307e-01 -4.238 2.26e-05 ***
## lang.newSK -1.089e+00 2.780e-01 -3.918 8.93e-05 ***
## lang.newSL 1.160e+01 2.065e+02 0.056 0.955221
## lang.newSV -8.749e-01 2.517e-01 -3.476 0.000509 ***
## lang.newTH -3.596e-01 2.806e-01 -1.282 0.199967
## lang.newTR -5.553e-01 2.506e-01 -2.216 0.026718 *
## lang.newUK 6.600e-01 4.434e-01 1.488 0.136637
## lang.newVI -1.525e-01 3.054e-01 -0.499 0.617549
## lang.newZH -9.876e-01 2.296e-01 -4.301 1.70e-05 ***
## sum.iap 2.305e-02 7.906e-04 29.153 < 2e-16 ***
## count.iap -8.643e-02 6.660e-03 -12.977 < 2e-16 ***
## iap.class$10.01-$20.00 -4.380e-01 6.848e-02 -6.397 1.59e-10 ***
## iap.class$20.01-$30.00 -1.612e+00 1.405e-01 -11.475 < 2e-16 ***
## iap.class$30.01-$40.00 -1.982e+00 1.422e-01 -13.935 < 2e-16 ***
## days.since.release -9.167e-04 2.060e-05 -44.491 < 2e-16 ***
## days.since.last.update 4.815e-04 2.601e-05 18.511 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 35855 on 71459 degrees of freedom
## Residual deviance: 28725 on 71384 degrees of freedom
## AIC: 28877
##
## Number of Fisher Scoring iterations: 15
## [1] 0.06766023
Looking at the coefficients for the count model, we have a higher ratio of significant to insignificant variables in languages and sub genres than the previous versions and we yet again decreased our mce, this time to 0.0676. One small issue was that the average user rating variable was insignificant again, but when we tried to remove it and my mce jumped so we kept it in. So this should be our best model right?
To confirm our results, we created a function to find the roc curve and mce for each model and then compare them
## # A tibble: 367,070 x 4
## observed predicted mce class
## <dbl> <dbl> <dbl> <chr>
## 1 1 0.533 0.0677 count
## 2 1 0.611 0.0677 count
## 3 1 0.533 0.0677 count
## 4 1 0.633 0.0677 count
## 5 1 0.685 0.0677 count
## 6 1 0.685 0.0677 count
## 7 1 0.593 0.0677 count
## 8 1 0.645 0.0677 count
## 9 1 0.645 0.0677 count
## 10 1 0.550 0.0677 count
## # ... with 367,060 more rows
## # A tibble: 4 x 2
## class mce
## <chr> <dbl>
## 1 base.fw 0.137
## 2 clean.fw 0.115
## 3 count 0.0677
## 4 sig.or 0.108
## # A tibble: 367,070 x 4
## observed predicted mce class
## <dbl> <dbl> <dbl> <chr>
## 1 1 0.533 0.0677 count 0.0677
## 2 1 0.611 0.0677 count 0.0677
## 3 1 0.533 0.0677 count 0.0677
## 4 1 0.633 0.0677 count 0.0677
## 5 1 0.685 0.0677 count 0.0677
## 6 1 0.685 0.0677 count 0.0677
## 7 1 0.593 0.0677 count 0.0677
## 8 1 0.645 0.0677 count 0.0677
## 9 1 0.645 0.0677 count 0.0677
## 10 1 0.550 0.0677 count 0.0677
## # ... with 367,060 more rows
Looking first at the mce plot, we can see the 4 different models and their respective mce’s with count being the lowest and base full with being the greatest. However, when we look at the ROC curve, we see that sig.or and clean full width models have curves closer to the edges and thus a greater AUC.
Because misclassification error is calculated using just one threshold, even though count has the smallest mce, because the ROC curve represents both type I and type II errors and shows classification results of all thresholds, sig.or and clean.fw are the better models.
Since clean.fw and sig.or were practically the same, we used sig.or as the best model to calculate some classification evaluation metrics.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor Genres has new levels Magazines, Newspapers
## [1] 0.1077057
## logit.count.class
## 0 1
## 0 3463 8416
## 1 1617 79656
## [1] 0.9801041
## [1] 0.2915229
## # A tibble: 93,152 x 2
## observed predicted
## <dbl> <dbl>
## 1 0 0.340
## 2 0 0.340
## 3 0 0.311
## 4 0 0.311
## 5 0 0.320
## 6 0 0.320
## 7 0 0.356
## 8 0 0.356
## 9 0 0.247
## 10 0 0.247
## # ... with 93,142 more rows
We weren’t able to find the 10 fold cross validation mce because cv.glm kept giving me an error about the dataset containing too many variables with too few data points. Which were attempted to be removed from the count model earlier. So, we just used the custom function to find mce and found a 10% rate of mistakes are made if we apply our model.
Then we found the confusion matrix using the table function to compare my observed and predicted values of whether the game was free or not. From the confusion matrix we were able to calculate a True Positive Rate of 0.98 and False Positive Rate of 0.7085.
Model evaluation and validations
##
## Call:
## lm(formula = `Average User Rating` ~ Size, data = clean_games)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0951 -0.5367 0.3238 0.4550 0.9650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.035e+00 1.006e-02 401.017 < 2e-16 ***
## Size 1.797e-10 3.383e-11 5.312 1.11e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7491 on 7486 degrees of freedom
## Multiple R-squared: 0.003756, Adjusted R-squared: 0.003623
## F-statistic: 28.22 on 1 and 7486 DF, p-value: 1.113e-07
## [1] 0.7491455
## [1] 0.1844232
## [1] 0.003755735
The RSE of our model Average User Rating ≈ f (Size) = β0 + β1 × Size is 0.7491. The percentage of prediction error is 18.4%. About 0.376% of the variability in Average User Rating is explained by a linear regression on Size. F-Statistic is much greater than 1 (28.22) so we can assume there is a relationship between Size and Average User Rating. As we separate size from the other predictor, we can calculate it’s RSE, adjusted R2, and other data. As we can see, the numbers were really close to what we had before except F-statistics is much higher which reinforces the idea that size and Average User Rating do have a strong relationship.
##
## Call:
## lm(formula = `Average User Rating` ~ Size, data = clean_games)
##
## Coefficients:
## (Intercept) Size
## 4.035e+00 1.797e-10
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The residual plot suggests that there is some non-linearity in the data.
## avg.user.rating user.rating.count price size
## Book 4.300000 57.6000 0.0000000 52586701
## Business 3.000000 16.5000 4.9950000 136453120
## Education 4.152174 124.8913 2.1041304 113516531
## Entertainment 3.831522 171.6087 0.2161957 76567793
## Finance 4.062500 7725.3750 17.4987500 84826496
## Food & Drink 5.000000 7.0000 0.0000000 106633216
## Warning: `funs()` was deprecated in dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## avg.user.rating_mean user.rating.count_mean price_mean size_mean
## 1 3.667553e-16 -1.100465e-17 -4.228324e-17 -4.297571e-17
## avg.user.rating_sd user.rating.count_sd price_sd size_sd
## 1 1 1 1 1
The variable “Primary Genre” is used because it is the key genre that each of these games are associated with. After grouping by genres, we found the average of each numerical variable: Average User Rating, User Rating Count, Price, and Size. We found the average because there is a lot of data for each genre, and it should be done in order to scale the data properly. Scaling is important before clustering analysis so it eliminates any bias.
## kcluster
## [1,] 80.0000000
## [2,] 53.3496139
## [3,] 36.2260487
## [4,] 26.4420736
## [5,] 16.3576898
## [6,] 10.7686989
## [7,] 8.4816453
## [8,] 6.4346684
## [9,] 4.5511644
## [10,] 3.2860759
## [11,] 2.5383748
## [12,] 1.8744334
## [13,] 1.4143411
## [14,] 1.1715535
## [15,] 0.8521516
We ran a for loop to find the best number of clusters, with nstart as 20 because that seems like a stable number of times to rerun. When we plot the elbow method, we find that 6 is the best number of clusters because it doesn’t decrease significantly after the cutoff at 6.
## Joining, by = "label"
When we plotted the dendrogram, we found that the primary genre “Games” is not associated with any other primary genres. There are no primary genres that are similar to “Games”.
## # A tibble: 10 x 4
## Name `Primary Genre` `Average User Ratin~ `User Rating Cou~
## <chr> <chr> <dbl> <dbl>
## 1 "Clash of Clans" Games 4.5 3032734
## 2 "Clash Royale" Games 4.5 1277095
## 3 "PUBG MOBILE" Games 4.5 711409
## 4 "Plants vs. Zombies\\~ Games 4.5 469562
## 5 "Pok\\xe9mon GO" Games 3.5 439776
## 6 "Boom Beach" Games 4.5 400787
## 7 "Cash, Inc. Fame & Fo~ Games 5 374772
## 8 "Idle Miner Tycoon: C~ Games 4.5 283035
## 9 "TapDefense" Games 3.5 273687
## 10 "Star Wars\\u2122: Co~ Games 4.5 259030
This table shows that the top 10 popular games all tagged the genre “Games” as their primary genre.
Thanks to the analysis, we can conclude that size is our best predictor for Average User Rating. It had the strongest relationship, and with the plot we did with linear regression lines, we saw that as the size of the game increases, the rating also increases which will contribute to a game’s overall success. Initial price also doesn’t dictate average user rating as much as we thought as a lot of the price points had the same average rating (around 4.5/5) until the price goes up to over 10 dollars. Even so, the average rating for those games average around 4/5.
Using the multiple classification regression, we were able to conclude that the best model to predict if an application will be free or not is the sig.or or clean.fw model. Using the model, we were able to successfully predict 79656 true positive values, 1617 false negative values, 3463 true negative values, and 8417 false positive values.
We can conclude that the primary genre “Games” is a very distinct genre. For a game to be recognized in the Apple App Store, their genre needs to be marked as “Games”. Since a lot of the popular games also tagged “Games” as their primary genre, we can assume that the genre “Games” is one of the many important factors that contributes to a popular strategy game.
Tristan. “17K Mobile Strategy Games.” Kaggle, 26 Aug. 2019, www.kaggle.com/tristan581/17k-apple-app-store-strategy-games.